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The fastest known algorithms for the solution of a large elliptic boundary value problem on a 
massively parallel hypercube all require 0(log(n)) floating point operauons and 0(log(n)) 
distance- 1 communications, if we define massively parallel to mean a number of processors 
proportional to the size n of the problem. The algorithm TIMA (for Totally Parallel Multilevel 
Algorithm) that we describe below has, as special cases, four of these fast algorithms. These four 
algorithms are PSMG (Parallel Superconvergent Multigrid) of Frederickson and McBryan, Robust 
Multigrid of Hackbusch, the FFT based Spectral Algorithm, and Parallel Cyclic Reduction. The 
algorithm TPMA, when described recursively, has four steps: 

( 1) Project to a collection of interlaced, coarser problems at the next lower leveL 

(2) Apply TPMA, recursively, to each of these lower level problem, solving directly at the 
lowest leveL 

(3) Interpolate these approximate solutions to the finer grid, and a verage them to form an 
approximate solution on this grid. 

(4) Refine this approximate solution with a defect-correction step, using a local approximate 
inverse. 

Choice of the projection operator P, the interpolation operator Q, and the smoother S determines 
the class of problems on which TPMA is most effective. There are special cases in which the first 
three steps produce an exact solution, and the smoother is not needed (e.g. constant coefficient 
oeprators). 


Key Words: Multilevel algorithm; Multigrid; Cyclic reduction; Spectral method . 
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1: Introduction. 

The fullest known algorithms for the solution of an 
elliptic boundary value problem 

Au = v (1.1) 

on a massiwly parallel hypercubc, by which wc mean a 
hypercube with a number p of processors proportional 
to the size n of the problem, are ail very closely related 
in structure. It is immediately apparent that all pro- 
ceed in 0(log(n)) stages (or levels) consisting of 0(n) 
floating point operations executed in parallel and O(p) 
parallel communications with a nearest neighbor. On 
closer examination one observes that in the k th level of 
any of these algorithms the problem 

A k u k - v h (1.2) 

is, in effect, being solved, and that this problem actually 
consists in a number of independent and interleaved sub- 
problems. It is this observation that we wish to clarify 
in the following sections. 

The algorithm TPMA (for Totally Parallel Multi- 
level Algorithm) that wc describe below has, as special 
cases, four of thes^ fast algorithms. These four algo- 



rithms are PSMG (Parallel Superconvergent Multigrid) 
of Frederickson and McBryan, Robust M ultigrid of Hack- 
busch, the FFT based Spectral Algorithm, and Parallel 
Cyclic Reduction. Choice of the projection operator P , 
the interpolation operator Q k , and the smoothing op- 
erutnr .S'* in the algorithm 'PPM A determines which of 
these particular algorithms is represented. Which algo- 
rithm one wishes to use depends on many things, among 
them the characteristics of the problem (l.l). What new 
algorithms fill the space between these known ones is yet 
to be determined. 

It is useful, when attempting to understand the al- 
gorithm TPMA, lo see clearly the intertwined subgrids 
at each of the multiple levels of the algorithm. These are 
more easily visualized in the case of plane triangulations, 
and hence we begin with these. 

2: Graph Colorings and Suhgrids. 

The nodes of an arbitrary planar graph can be col- 
ored using at most four colors with no two adjacent nodes 
having the same color. Fewer colors may suffice. For ex- 
ample, the graph in Fig. 1 corresponding to an equilat- 
eral triangulation of the plane requires only three colors. 

If we connect the nodes of the same color that are a 
graph distance two apart wc form three interlaced sub- 
grids, each larger by & factor >/3 and rotated, as shown 
in Fig. 2. Observe that each subgrid corresponds to a 
subgraph of the square of the original graph. This is 
equivalent to the statement that nodes connected by an 
edge in the subgrid connect points that are separated by 
two edges of the original graph. 

Similarly, the graph of the familiar five point Lapla- 
cian requires only two colors. The two subgrids that 
are formed by connecting nodes of like color which are a 
graph distance two apart in the original grid are larger 
by the factor y/2 . Use of this grid offers some advantage 
over the triangular grid when computing on a binary 
hypercube, for communication distance along each grid 
axis remains a power of two as the algorithm descends 
through the levels. Ilypercuhe communication distance 
is therefore bounded by four. 


Work reported herein was supported in part by Cooperative Agreement NCC2-387 between the Na- 
tional Aeronautics and Space Administration (NASA) and the Universities Space Research Association 
(USRA) 
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Fig. 2 


When using TPM A to solve an elliptic problem dis- 
cretized onto an unstructured triangular grid we are able 
to generate interlaced snbgrids using an imperfect three- 
coloring, in which the number of adjacent nodes of the 
same color is minimized, or an imperfect two-coloring. 
This construction of interlaced snbgrids is repeated re- 
cursively: each of these subgrids is a triangulation in its 
own right. 

Finally, we observe that the graph of the nine-point 
Laplacian, although not planar, is colorable in four col- 
ors. The four interleaved subgrids, each larger by the 
factor 2 = y/i, lead to a very high performance imple- 
mentation of TPM A on highly parallel hypercubes, and 
the CM2 in particular. 


ing equation at level k — 1. The easiest example of a 
projection operator, P* = J, is used in some versions of 
the algorithm TPM A. In this case the first step in pro- 
jecting equation (3.1) to the interleaved subgrids at level 
Jb — 1 is particularly easy. 

Suppose that we have been able, somehow, to con- 
struct an approximate solution u* 1 in X h 1 to the 
equation = v k ~ l . Then we will use the 

interpolation operator Q k to map it into an approximate 
solution u k — of equation (3.1). 

The effect of Q* is to combine the approximate so- 
lutions from all of the interleaved subgrids of a given grid 
into one approximate solution on that grid. Except at 
the highest level this grid will, in turn, be one of several 
interleaved subgrids of a grid at a yet higher level. In 
many cases Q k is best described as an averaging opera- 
tor, while in other cases it will simply be the identity. 

The convergence theory is particularly easy to state 
when the two operators Q h and P fc are adjoint or dual to 
each other. This is the situation, for example, when they 
arc constructed using the Raleigh-Ritz-Calerkin proce- 
dure. 

In most cases the operator A*" 1 at level — 1 is 
defined recursively using 

A*- 1 = P k A h Q\ .(3.2) 

or, equivalently, via the commutative diagram 



3: TPMA: Totally Parallel Multilevel Algorithm. 

Definition of the algorithm TPMA requires specifi- 
cation of the two operators Q k and P* for every level 
0 < k < m, and the operators A k and S h for 0 < k < m. 
Consider first the projection operator P k y k — ♦ y k ~ l 
which uses the data v* in the equation 

- v* (3.1) 

to construct the data v k ~ l = P k v k of the correspond- 


X k -il y* 

[<?* J P k (3.3) 

The task now is to solve the system (3.1) at every 
level k. At level k ■= 0 this is easier than at any other 
level, for the original system has been reduced to as many 
independent systems as possible. In many cases each 
of these system is only a scalar equation, and solution 
requires only a division. In any case we will denote the 
solution operator by 5°. 

In general the approximate solution u k = 
is not an exact solution to eqn. (3.1), and it is advan- 
tageous to use a local approximate inverse P k to the 
operator A* in a defect- correction step 

u* -u* + P*( v k - A k u h ) (3.4) 

after interpolation. In some cases the Jacobi operator 
(the reciprocal of the diagonal of the operator A fc ) is 
adequate ns a defect correction operator .9*, but in many 
cases it is worthwhile to take into account more of the 
structure of A k when constructing S h , 






The algorithm TPM A that we have described im- 
plicitly above may now be defined explicitly to be a rep- 
resentation of the operator T k given recursively by 

T k = s * + ( / - S k A h )Q k T h ’ x S k (3.7) 

with T° = 5° as initial condition. When the four 

operators satisfy certain inequalities (see [4] for details) 
one can prove that T k is an approximate inverse to A h . 

We will see that there are versions of TPMA for 
which u h — Q k u h ~ l is an exact solution to equation 
(3.1), rather than just an approximate solution. In this 
case the local approximate inverse S h is not needed, or 
may be taken to be 0, except at the bottom level k = 0 
where we assume that S°A° = J. Then the recursive 
definition of T h reduces to T h = Q* T *-ip* p or the 
commutative diagram 

,v* , T - y* 

[<?* |p‘ (3.6) 


4: FSMG: Parallel Superconvergcnt Multigrid. 

When solving an elliptic problem discretized onto a 
given grid (or graph) it is useful to have an approximate 
solution on a coarser grid, for this may be interpolated 
onto the given grid to serve as the initial approxima" 
tion of a defect correction algorithm. This idea leads 
one to the classic multigrid algorithm, perhaps the most 
obvious example of a multilevel algorithm. In multigrid, 
recursively coarser grids are used, with the lowest level 
grid having so few points that the corresponding approx- 
imate elliptic problem can be solved directly. These al- 
gorithms are multilevel, but not totally parallel. In fact, 
almost all of the processors are standing idle almost all 
of the time. Isn’t there something useful that these idle 
processors could do to further the solution? 

This is the question that Oliver McBryan and the 
author asked themselves, and that led them to develop 
the totally parallel multilevcHalgorithm PSMG [4,5]. 
The essential idea of PSMG is to project the given 
problem onto all of the available coarser grids, forming 
enough lower level problems to keep all of the processors 
busy all of the time. This is done recursively, as in or- 
dinary multigrid. The payoff is much faster convergence 
at no added computational cost. In fact, the program is 
somewhat simpler on the Connection Machine, for it is 
no longer necessary to turn more of the processors off at 
every step. 


5: The FFT Based Spectral Method. 

The classical spectral algorithm for the solution of 
a constant coefficient elliptic, problem on a rectangular 
domain has three steps: take the FFT of the problem 
data v, divide this by the transform of the differential 
operator, and transform back. We may, however, equally 
well describe it as an example of the algorithm TPMA 
in which the operator P k is given by 

P*0 = -5=( 0 1 «*’•) (5.1) 

v 2 

when 0 < ( * mod 2 m ** l+l ) < 2 m_ *, and 

p\ = 4?( i - w * < * °) ( 5 - 2 > 

v2 

otherwise. This three point operator works on points a 
distance 2 m ”* apart, which are the points of the suhgrid 
at that level. This is the operation of P h in the first 
dimension, and is followed by a pair of operators trans- 
verse to these in each of the other dimensions. Q k is the 
adjoint of P h t and S k = 0 except at the lowest level, 
where 5° is the reciprocal of the fourier transform of the 
differential operator. 

An important observation is that the Fourier mul- 
tipliers w** need be computed only once, on the high- 
est level. Those coefficients needed at the next lower 
level are a Hamming distance at most two away, and are 
moved in at the start of the computation at that level. 
For a more detailed discussion of FFT implementations 
on highly parallel computers see the recent papers of 
Kamin and Adams [l] and Schwarztrauber et. al. [9]. 

6: Parallel Cyclic Reduction. 

The cyclic reduction (odd-even reduction) algo- 
rithm of Buneman [2] and Hockney [6] for solving a tridi- 
agonal, or block tridiagonal, system of equations of the 
form 

(A u)i = + U< + Ot.i + ltti + t ■ = *i (6.1) 

is another example of the algorithm TPMA. Here the 
projection operator P h is defined by 

P k vi = — * (6.2) 

where j denotes 2 m ~*, and the lower-level operators A k 
are defined, recursively, by 

= P k A k (6.3) 

In the standard version of this algorithm, optimal on a 
sequential computer, the number of equations reduces 
by a factor of two at every step. After log(n) steps, 


but only 0(n) operations, a single system remains, and 
this is solved directly. Log(n) stages of back substitution 
remain to be done before the solution is known over the 
whole array. 

The totally parallel version of this algorithm (Hock- 
ney and Jcssope [7|) projects at every stage onto all 
nodes of the grid using the same projection operator 
(6.2). The advantage is that the solution is known at 
all nodes after the first log(n) stages. It is just twice as 
fast, therefore, as ordinary cyclic reduction on a suffi- 
ciently parallel computer. This is the version of cyclic 
reduction that is a special case of TPMA. To see this, 
observe that eqn. (6.3) is equivalent to eqn. (3.2) with 
Q k - T, which is what pnrnllcl cyclic reduction requires. 
Because Interpolation is exact, wc take the TPMA op- 
erator S* = 0 in this case. For a fuller discussion of 
parallel cyclic reduction and related direct multilevel al- 
gorithms the recent paper of Swartztrauber and Sweet 
[8] is recommended. 

Sum in ary. 

We describe the algorithm TPMA (Totally Pnrallel 
Multilevel Algorithm) and demonstrate that three of the 
fastest known algorithms for solving an elliptic boundary 
value problem on a highly parallel hypercube, such as the 
Connection Machine of Thinking Machines Corporation, 
are special cases of TPMA. These are the FFT based 
spectral algorithm, parallel cyclic reduction, and PSMG. 
Each of these appears to be optimal in certain situations. 
Since all are special cases of the same algorithm TPMA 
it may be possible to combine the advantages, and form 
hybrid algorithms that are better than any of these in 
particular problem domains. 
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